Challenges for Discontiguous Phrase Extraction

نویسندگان

  • Dale Gerdemann
  • Gaston Burek
چکیده

Suggestions are made as to how phrase extraction algorithms should be adapted to handle gapped phrases. Such variable phrases are useful for many purposes, including the characterization of learner texts. The basic problem is that there is a combinatorial explosion of such phrases. Any reasonable program must start by putting the exponentially many phrases into equivalence classes (Yamamoto and Church, 2001). This paper discusses the proper characterization of gappy phrases and sketches a suffix-array algorithm for discovering

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Syntax-Based Machine Translation: The Contribution of Discontiguous Phrases

We present a new unsupervised syntax-based MT system, termed U-DOT, which uses the unsupervised U-DOP model for learning paired trees, and which computes the most probable target sentence from the relative frequencies of paired subtrees. We test U-DOT on the German-English Europarl corpus, showing that it outperforms the state-of-the-art phrase-based Pharaoh system. We demonstrate that the incl...

متن کامل

String-to-Tree Multi Bottom-up Tree Transducers

We achieve significant improvements in several syntax-based machine translation experiments using a string-to-tree variant of multi bottom-up tree transducers. Our new parameterized rule extraction algorithm extracts string-to-tree rules that can be discontiguous and non-minimal in contrast to existing algorithms for the tree-to-tree setting. The obtained models significantly outperform the str...

متن کامل

A Study of Translation Rule Classification for Syntax-based Statistical Machine Translation

Recently, numerous statistical machine translation models which can utilize various kinds of translation rules are proposed. In these models, not only the conventional syntactic rules but also the non-syntactic rules can be applied. Even the pure phrase rules are includes in some of these models. Although the better performances are reported over the conventional phrase model and syntax model, ...

متن کامل

Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences

In this paper, we present a new technique for the extraction of discontiguous sequential descriptors from text. We are able to form word sequences without any restriction on their size or on the distance between their components. Based on the concept of a maximal frequent sequence (MFS), our approach allows for the extraction of compact text descriptors of quality in a more efficient manner tha...

متن کامل

مدل ترجمه عبارت-مرزی با استفاده از برچسب‌های کم‌عمق نحوی

Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010